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Abstract. In this paper, we review the datasets of emotional speech 
publicly available and their usability for state of the art speech synthe¬ 
sis. This is conditioned by several characteristics of these datasets: the 
quality of the recordings, the quantity of the data and the emotional con¬ 
tent captured contained in the data. We then present a dataset that was 
recorded based on the observation of the needs in this area. It contains 
data for male and female actors in English and a male actor in French. 
The database covers 5 emotion classes so it could be suitable to build 
synthesis and voice transformation systems with the potential to control 
the emotional dimension. 
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1 Introduction 

One of the major components of human-agent interaction systems is the speech 
synthesis module. The state-of-the-art speech synthesis systems such as wavenet 
and tacotron [1-3] are giving impressive results. They can produce, intelligible, 
expressive, even human-like speech. But, they cannot yet be used to control 
the emotional dimensionality in speech which is a crucial parameter in order to 
obtain human-like controllable speech synthesis system. 

Although still being relatively neglected by the affective computing commu¬ 
nity, the interest for emotional speech synthesis systems has been growing for 
the past two decades. After the improvement parametric systems brought to this 
field [4,5], deep learning-based systems were also employed for this task. 

One of the problems in the emotional speech synthesis research community 
is the lack of publicly available data and the difficulty to collect them. In fact, to 
the best of our knowledge, no emotional speech database for synthesis purpose 
and suitable for deep learning systems is publicly available. In this paper, we try 
to tackle this problem. 

In what follows we will present a review of emotional speech datasets in 
Section 2. We will then describe the motivations for collecting a new database in 
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Section 3 and detail the content of a newly released database 1 that fulfill these 
motivations in Section 4. 


2 Review 

Emotions can be represented in different ways. A first representation, is Ekman’s 
six basic emotion model [6] which identify anger, disgust, fear, happiness, sadness 
and surprise as six basic emotions from which the other emotions may be derived. 
Emotions can also be represented in a multidimensional continuous space like 
in the Russels circomplex model [7] (valence and arousal being the currently 
most famous dimensions used). A more recent way of representing emotions is 
based on ranking which prefer a relative preference method to annotate emotions 
rather than labeling them with absolute values [8]. 

Several open-source databases can be found but to the best of our knowledge, 
none is really suitable for emotional speech synthesis purpose. In this section we 
will explain why and mention some examples. 

The RAVDESS database emotional data for 24 different actors [9]. The actors 
were asked to read 2 different sentences in a spoken and sung way in North 
American English. The spoken style was recorded in 8 different emotional styles: 
neutral, calm, happy, sad, angry, fearful, disgust, surprise. Each utterance was 
expressed at 2 different intensities each (except for the neutral emotion) and 2 
times thus giving a total of 1440 files. A perception test was then undertook to 
validate the database on the emotional categories, intensity and genuineness. 

The CREMA-D database [10] is similar to the RAVDESS. For this database, 
12 different sentences were recorded by 91 different actors, for the 6 basic emo- 
tionsdiappy, sad, anger, fear, disgust, and neutral. Only one of the 12 sentences 
was expressed in 3 different intensities, for the other 11, the intensity was not 
specified. The authors report 7442 files in total. This database was also validated 
through perception tests and helped validate the emotion category and intensity. 

Also similar to the previous ones, the GEMEP database [11] is a collection of 
10 French-speaking actors, recorded uttering 15 different emotional expressions 
at three levels of intensity, in three different ways: improvised sentences, pseudo¬ 
speech, and nonverbal affect bursts. This database counts a total of 1260 audio 
files. It was also validated through perception tests. 

The Berlin Emotional Speech Dataset [12] contains the recording of 10 dif¬ 
ferent utterances by 10 different actors in 7 different emotions (neutral, anger, 
fear, joy, sadness, disgust and boredom) in German, making it a total of 800 ut¬ 
terances (counting some second version of some of the sentences). This database 
was, like the previous ones, validated using perception experiments. 

These databases are not suitable for current state of the art speech synthesis 
purpose because of the limited amount of sentences recorded. 

Moreover, the six basic emotions do not really occur in daily conversations. 
Indeed, in Ekman’s model, on which the choice of emotions was based for these 

1 https://github.com/numediart/EmoV-DB 
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datasets, the basic emotions are the ones from which other emotions derive. But 
that does not necessarily mean that they are frequently expressed in speech in 
our daily interactions. 

The IMPROV [13] and IEMOCAP [14] databases both contain a large amount 
of diverse sentences of emotional data. IEMOCAP contains audio-visual record¬ 
ings of 5 sessions of dyadic conversations between a male and a female subjects. 
In total it contains 10 speakers and 12 hours of data. IMPROV contains 6 ses¬ 
sions from 12 actors resulting in 9 hours of audiovisual data. Both databases 
were evaluated in terms of category of emotions [6] and emotional dimensions [7] 
by several subjects. However they are not suitable for synthesis purpose either 
because although the data is well recorded and post-processed it contains over¬ 
lapping speech due to the data recording setup (dyadic conversation) and some 
external noise. 

The CMU Arctic Speech Database [15] and the SIWIS French Speech Syn¬ 
thesis Database [16] are collections of read utterances of phonetically balanced 
sentences in English and French respectively. The CMU-Arctic database con¬ 
tains approximately 1150 sentences recorded from each of 4 different speakers 
while SIWIS contains a total of 9750 utterances from a single speaker. These 
are database suitable for speech synthesis purpose as there is a large amount 
of different sentences recorded from a single speaker in noiseless environment. 
However the sentences are neutral and do not express any emotions. 

The ArnuS database contain audio data dedicated to amused speech synthe¬ 
sis [17]. We showed in previous work [18-20] that this database was well suited 
for amused speech synthesis. But ArnuS contains data only for amused speech 
and not other emotions. 


3 Motivations 

This database’s 2 primary purpose is to build models that could not only pro¬ 
duce emotional speech but also control the emotional dimension in speech [21, 
22]. The techniques to allow this are either text-to-speech like systems where the 
system would map a given text sentence to a speech audio signal or voice trans¬ 
formation systems where a source voice would be converted to a specific target 
emotional voice. Considering this, it is obvious that a lot of data is required. 
One of the primary difficulties of building emotional speech-based generation 
systems is the collection of data. Indeed not only must the recording be of good 
quality and noise free, but the task of expression emotional sentences in a large 
enough amount is challenging. Also it is often preferable concerning these types 
of systems, that a certain category of emotion contains data that are similar on 
the acoustic level. 

The database presented here was built with these requirements in mind. 
The aim was also for it to fit with other currently open-source databases to 
maximize the quantity of data available. As mentioned previously, the CMU- 
Arctic database (English) and the SIWIS (French) databases are two datasets 

2 https://github.com/numediart/EmoV-DB 
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of neutral speech. Each of them contain a relatively large amount of data that 
can be used as source voices for a voice conversion system or as pre-training 
data for a system. They are also transcribed which makes the transcription also 
available for our database. The transcribed utterances as well as annotations at 
phonetic level are available. A subset of these were used to build our database. 
The phonetic annotations are not time-aligned with our data yet, but methods 
can be used such as forced alignment systems [23]. 

We chose five different emotions: amusement, anger, sleepiness, disgust and 
neutral. We chose emotions that are more likely to be expressed in daily con¬ 
versations than Ekman’s basic emotions. These emotions were chosen because 
of the ease to produce them by actors and in order to cover a diverse space in 
the Russel Circumplex to allow experimenting with interpolation techniques to 
obtain intermediate emotions. 


4 Database Content 

The data was recorded in 2 different languages English (North American) and 
French (Belgian). English natives (2 females and 2 males) and a single male 
French native were asked to read sentences while expressing one of the above 
mentioned emotions. The English sentences were taken from the CMU-arctic 
database. The French ones from the SIWIS database. Both databases contain 
freely available open-source phonetically balanced sentences. 

The recordings for the English data were carried on in two different anechoic 
chambers of the Northeastern University campus. The ones for the French data 
were made in an anechoic room at the University of Mons. 

The utterances were recorded in several sessions of about 30 minutes record¬ 
ings followed by a 5 to 15 minutes break and the data collection was spread 
across several days depending on the availability of the actors. The actors were 
asked to repeat sentences that were mispronounced. 

The actors were asked to record each emotion class separately in different 
sessions. At the moment of redaction of this article, the sentences were segmented 
manually for some of the speakers (annotation and segmentation is still ongoing). 
By segmentation we mean determining the intervals of start and end of each 
sentence. The total number of utterances obtained is summarized in Table 1. 


Table 1. Gender and language of recorded sentences from each speaker and amount 
of utterances segmented per speaker and per emotion. 


Speaker 

Gender 

Language 

Neutral 

Amused 

Angry 

Sleepy 

Disgust 

Spk-Je 

Female 

English 

417 

222 

523 

466 

189 

Spk-Bea 

Female 

English 

373 

309 

317 

520 

347 

Spk-Sa 

Male 

English 

493 

501 

468 

495 

497 

Spk-Jsli 

Male 

English 

302 

298 

- 

263 

- 

Spk-No 

Male 

French 

317 

- 

273 

- 

- 
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Amused speech can contain chuckling sounds which overlap and/or intermin¬ 
gle with speech called speech-laughs [24] or can be only amused smiled speech [5]. 
So, for the amused data in our database, in order to collect as much data as possi¬ 
ble and considering the relatively limited time the actors provided us, we focused 
on amused speech with speech-laughs. This choice was motivated by our previous 
study showing that this type of amused speech was perceived as more amused 
than amused smiled speech (without speech-laugh). Also in another study, we 
show that including laughter in synthesized speech is always perceived as amused 
no matter the style of speech it is inserted in (neutral or smiled) [20]. Based on 
the previous studies made on amusement, the actors were encouraged, while 
simulating the other emotions, to use nonverbal expressions [25] before and even 
while uttering the sentences if they felt the need to (e.g. yawning for sleepiness, 
affect bursts for anger and disgust). 
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